VAST Contest, UC Davis Entry

VIDI Research Group, University of California, Davis

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

      Yuzuru Tanahashi, ytanahashi@ucdavis.edu [PRIMARY contact]
      Yingcai Wu, ycwu@ucdavis.edu
      Kwan-Liu Ma, ma@cs.ucdavis.edu
      VIDI Research Group, University of California, Davis

Tool(s):

For this mini challenge, we have implemented an interactive visualization tool using Netzen, a visual network analysis system created by the VIDI Research Group at the University of California, Davis.

In order to analyze the data efficiently, line charts and stack graphs were added to the system. These plotting features were selected due to their individual capabilities of analyzing the data. The line charts were implemented with a function that enabled interactive smoothing by approximating the trajectory of the lines. The approximation was calculated by averaging each points with the neighboring values. With this additional featue we could easily transact from analyzing approximated simplified data to actual concise data. The stack graphs were implemented with a function that enabled the users to highlight features in stacks by changing the opacity. The features to highlight, such as the width or the change in width, were also configurable by the users. This additional function had a significan effect on reducing the time for identifying the key symptoms of the disease.

It took us about two weeks to implement these plotting features and enhancements . The resulting tool allows us to more easily characterize the spread of a disease.

Video:

[.mov] (66.9MB)
This video demonstrates the intereactive exploration of our analysis.

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

Filtering the data :

Due to the large quantity and the losses of the data (i.e., Thailand, Turkey, and Venezuela have one day lacking in the data), we have compiled the data into a daily percentile format. After this compilation, we were left with 75 records of daily data from April 16th to June 29th 2009 for each area. In this percentile data, we have assigned each symptom two values: HS (Hospitalize Symptom) and DP (Death Probability). HS is the percentage of the patients with the subjected symptom over the total number of patients. DP is the mortality rate of the subjected symptom. We did not find any significant results when we analyzed the gender and age aspects of the data. Therefore, we did not consider these in the rest of the study.

Analyzing the disease :

Figure 1 shows line charts of the mortality rates for each area. From these charts, we can instantly notice that all areas, except Thailand and Turkey, have a large rise in mortality rates.

Figure 1. (a):Line chart of mortality rates for all areas. Horizontal axis represents date. Vertical axis represents mortality rates. (b):Same chart as (a) with smoothed lines. (c):Same chart as (b) with further smoothed lines.

From Figure 1, we can also see that the onset of the epidemic was from around April 23rd (circled in red), and reached its peak about May 15th (circled in blue), three weeks from the outbreak. Depending on areas, the recovering took about three to four weeks.

Next, in order to analyze the symptoms of the disease , we created stack graphs of HS data, shown in Figure 2. By highlighting HS rates, we can instantly see that the key symptoms of the disease were "ABDOMINAL PAIN", "DIARRHEA", "FEVER", and "VOMITING". Although "BACK PAIN" also seems to be one of the key symptoms, the stack width of it is almost the half of the others. Therefore, for simplicity we will concentrate on these four major symptoms.

Figure 2. Stack graphs of HS data. X-axis: Date. Y-axis(Width): HS value. The opacity is set to highlight the change in width of a stack. "ABDOMINAL PAIN", "BACK PAIN", "DIARRHEA", "FEVER", and "VOMITING" are highlighted in most of the graphs.

By identifying the key symptoms, we can further analyze the disease by filtering out other unimportant symptoms. Figure 3 shows the stack graphs of DP data of the key symptoms.

Figure 3. Stack graph of DP data of "ABDOMINAL PAIN", "DIARRHEA", "FEVER", and "VOMITING". The blue lines are indicators for May 30th. The red boxes are indicators of the onset of the epidemic.

From the red boxes in Figure 3, we can see that in all areas, except Thailand and Turkey, it takes about three to five days for the DP values to reach their peak. We consider this indicates that the disease had about one to five days of incubating period. Also, from Figure 3, the DP values of the symptoms tend to keep stable during the epidemic around 12.0. This indicates that the patients hospitalized with the disease had about 88.0 percent of chance of survival.

MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

Comparing areas :

By comparing the onset timings from the graphs shown in Figure 1, we can see that the epidemic first started in Nairobi, then immediately traveled into the Middle East (Aleppo, Karachi, Saudi Arabia, and so on) and South America (Colombia, Venezuela). The red boxes in Figure 3 also validates this observation. Although most areas in Middle East were exposed to the epidemic, it did not outbreak in Turkey. Thailand also showed no sign of the epidemic throughout the period.

Figure 4. (a):Line chart of patients for all areas. Horizontal axis represents date. Vertical axis represents the number of patients normalized by the number of patients in non epidemic day. This was approximated by taking the average of 5 to 20 percentile median for each area. (b):Same chart as (a) with smoothed lines. (c):Same chart as (b) with further smoothed lines.

From Figure 4, although it is not possible to see any temporal differences by simply plotting the actual data on to a line chart, shown in (a), by utilizing our smoothing function we can easily observe that areas such as Lebanon, Venezuela, and Saudi Arabia did not have as a significant growth in number of patients as other areas such as Aleppo, Karachi, Nairobi, and Yemen.

Throughout the analysis, we do not find significant differences between areas in their recovery ability. However, in Figure 3 (blue vertical lines), we can observe that the DP values of almost all areas start to decrease from May 30th. This significant decrease, in our opinion, indicates that people might have found a new way to treat the disease around that time.

Anomalies :

In Figure 1, there are several spikes of high mortality rates (circled in green). These spikes can also be seen in Figure 2 and 3. Due to the irregularity and its extraordinary high value, we consider these spikes may emerge from forged data.

Figure 5. Stack graphs of DP data of all symptoms. In this graph, the key symptoms of the disease are not highlighted.

Figure 5 shows the stack graphs of the DP data for all symptoms. From Figure 5, we can observe many symptoms' DP values rise during the period of the epidemic. This indicates that the patients that were hospitalized by other miscellaneous symptoms also risked a higher mortality rate than usual. Not only that, but also we can observe symptoms with the key word "VAGINAL" highlighted in many of these graphs. By investigating the symptoms such as "VAGINAL BLEEDING", we are convinced that these symptoms have a good chance of being strongly associated with HIV patients. Assuming this is true, Figure 5 also conveys the vulnerability of HIV patients against the disease, and the population of them within different areas.